<!DOCTYPE html>

<html>
<head>
<meta charset="UTF-8">
<link href="style.css" type="text/css" rel="stylesheet">
<title>VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes </title></head>
<body>
<h1>VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes</h1>
<table>
<tr>
<th>Opcode/Instruction</th>
<th>Op /En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th></tr>
<tr>
<td>
<p>EVEX.NDS.128.66.0F3A.W0 42 /r ib</p>
<p>VDBPSADBW xmm1 {k1}{z}, xmm2, xmm3/m128, imm8</p></td>
<td>FVM</td>
<td>V/V</td>
<td>
<p>AVX512VL</p>
<p>AVX512BW</p></td>
<td>Compute packed SAD word results of unsigned bytes in dword block from xmm2 with unsigned bytes of dword blocks transformed from xmm3/m128 using the shuffle controls in imm8. Results are written to xmm1 under the writemask k1.</td></tr>
<tr>
<td>
<p>EVEX.NDS.256.66.0F3A.W0 42 /r ib</p>
<p>VDBPSADBW ymm1 {k1}{z}, ymm2, ymm3/m256, imm8</p></td>
<td>FVM</td>
<td>V/V</td>
<td>
<p>AVX512VL</p>
<p>AVX512BW</p></td>
<td>Compute packed SAD word results of unsigned bytes in dword block from ymm2 with unsigned bytes of dword blocks transformed from ymm3/m256 using the shuffle controls in imm8. Results are written to ymm1 under the writemask k1.</td></tr>
<tr>
<td>
<p>EVEX.NDS.512.66.0F3A.W0 42 /r ib</p>
<p>VDBPSADBW zmm1 {k1}{z}, zmm2, zmm3/m512, imm8</p></td>
<td>FVM</td>
<td>V/V</td>
<td>AVX512BW</td>
<td>Compute packed SAD word results of unsigned bytes in dword block from zmm2 with unsigned bytes of dword blocks transformed from zmm3/m512 using the shuffle controls in imm8. Results are written to zmm1 under the writemask k1.</td></tr></table>
<h3>Instruction Operand Encoding</h3>
<table>
<tr>
<td>Op/En</td>
<td>Operand 1</td>
<td>Operand 2</td>
<td>Operand 3</td>
<td>Operand 4</td></tr>
<tr>
<td>FVM</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv</td>
<td>ModRM:r/m (r)</td>
<td>Imm8</td></tr></table>
<p><strong>Description</strong></p>
<p>Compute packed SAD (sum of absolute differences) word results of unsigned bytes from two 32-bit dword elements. Packed SAD word results are calculated in multiples of qword superblocks, producing 4 SAD word results in each 64-bit superblock of the destination register.</p>
<p>Within each super block of packed word results, the SAD results from two 32-bit dword elements are calculated as follows:</p>
<p>The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The destination operand is conditionally updated based on writemask k1 at 16-bit word granularity.</p>
<table>
<tr>
<td>
<p>127+128*n</p>
<p>128-bit Lane of Src2</p></td>
<td>
<p>95+128*n</p>
<p>imm8 shuffle control</p></td>
<td>
<p>63+128*n</p>
<p>0</p></td>
<td>
<p>31+128*n</p>
<p>00B: DW0 </p></td></tr></table></body></html></html><html><body><table><tr><td><p>01B: DW1 </p></td></tr></table></body></html></html><html><body><table><tr><td><p>10B: DW2 11B: DW3</p></td>
<td>128*n</td></tr>
<tr>
<td>
<p>127+128*n</p>
<p>128-bit Lane of Tmp1</p>
<p>24</p>
<p>32</p>
<p>_</p>
<p>abs</p>
<p>+</p>
<p>47</p>
<p>63</p>
<p>_</p>
<p>abs</p></td>
<td>
<p>95+128*n</p>
<p>Tmp1 sliding dword</p>
<p>Src1 stationary dword 1</p>
<p>16</p>
<p>32</p>
<p>_</p>
<p>abs</p>
<p>+</p></td>
<td>
<p>63+128*n</p>
<p>31</p>
<p>23</p>
<p>_</p>
<p>abs</p>
<p>Tmp1 sliding dword</p>
<p>Src1 stationary dword 1</p></td>
<td>
<p>31+128*n</p>
<p>Tmp1 qword superblock</p>
<p>8</p>
<p>0</p>
<p>_</p>
<p>abs</p>
<p>+</p>
<p>15</p>
<p>15</p>
<p>_</p>
<p>abs</p>
<p>+</p></td>
<td>
<p>128*n</p>
<p>Tmp1 sliding dword</p>
<p>Src1 stationary dword 0</p>
<p>0</p>
<p>Tmp1 sliding dword</p>
<p>0</p>
<p>Src1 stationary dword 0</p>
<p>_</p>
<p>abs</p></td></tr>
<tr>
<td>
<p>63</p>
<p>Destination qword superblock</p></td>
<td>47</td>
<td>31</td>
<td>15</td>
<td>0</td></tr></table>
<table>
<tr>
<td>DW3</td>
<td>DW2</td>
<td>DW1</td>
<td>DW0</td></tr></table>
<h3>Figure 5-8.  64-bit Super Block of SAD Operation in VDBPSADBW</h3>
<p><strong>Operation</strong></p>
<p><strong>VDBPSADBW (EVEX encoded versions)</strong></p>
<p>(KL, VL) = (8, 128), (16, 256), (32, 512)</p>
<p>Selection of quadruplets:</p>
<p>FOR I = 0 to VL step 128</p>
<p>TMP1[I+31:I] (cid:197) select (SRC2[I+127: I], imm8[1:0])</p>
<p>TMP1[I+63: I+32] (cid:197) select (SRC2[I+127: I], imm8[3:2])</p>
<p>TMP1[I+95: I+64] (cid:197) select (SRC2[I+127: I], imm8[5:4])</p>
<p>TMP1[I+127: I+96](cid:197) select (SRC2[I+127: I], imm8[7:6])</p>
<p>END FOR</p>
<p>SAD of quadruplets:</p>
<p>FOR I =0 to VL step 64</p>
<p>TMP_DEST[I+15:I] (cid:197) ABS(SRC1[I+7: I] - TMP1[I+7: I]) +</p>
<p>ABS(SRC1[I+15: I+8]- TMP1[I+15: I+8]) +</p>
<p>ABS(SRC1[I+23: I+16]- TMP1[I+23: I+16]) +</p>
<p>ABS(SRC1[I+31: I+24]- TMP1[I+31: I+24])</p>
<p>TMP_DEST[I+31: I+16] (cid:197)ABS(SRC1[I+7: I] - TMP1[I+15: I+8]) +</p>
<p>ABS(SRC1[I+15: I+8]- TMP1[I+23: I+16]) +</p>
<p>ABS(SRC1[I+23: I+16]- TMP1[I+31: I+24]) +</p>
<p>ABS(SRC1[I+31: I+24]- TMP1[I+39: I+32])</p>
<p>TMP_DEST[I+47: I+32] (cid:197)ABS(SRC1[I+39: I+32] - TMP1[I+23: I+16]) +</p>
<p>ABS(SRC1[I+47: I+40]- TMP1[I+31: I+24]) +</p>
<p>ABS(SRC1[I+55: I+48]- TMP1[I+39: I+32]) +</p>
<p>ABS(SRC1[I+63: I+56]- TMP1[I+47: I+40])</p>
<p>TMP_DEST[I+63: I+48] (cid:197)ABS(SRC1[I+39: I+32] - TMP1[I+31: I+24]) +</p>
<p>ABS(SRC1[I+47: I+40] - TMP1[I+39: I+32]) +</p>
<p>ABS(SRC1[I+55: I+48] - TMP1[I+47: I+40]) +</p>
<p>ABS(SRC1[I+63: I+56] - TMP1[I+55: I+48])</p>
<p>ENDFOR</p>
<p>FOR j (cid:197) 0 TO KL-1</p>
<p>i (cid:197)  j * 16</p>
<p>IF k1[j] OR *no writemask*</p>
<p>THEN DEST[i+15:i] (cid:197) TMP_DEST[i+15:i]</p>
<p>ELSE</p>
<p>IF *merging-masking*</p>
<p>; merging-masking</p>
<p>THEN *DEST[i+15:i] remains unchanged*</p>
<p>ELSE</p>
<p>; zeroing-masking</p>
<p>DEST[i+15:i] (cid:197) 0</p>
<p>FI</p>
<p>FI;</p>
<p>ENDFOR</p>
<p>DEST[MAX_VL-1:VL] (cid:197) 0</p>
<p><strong>Intel C/C++ Compiler Intrinsic Equivalent</strong></p>
<p>VDBPSADBW __m512i _mm512_dbsad_epu8(__m512i a, __m512i b);</p>
<p>VDBPSADBW __m512i _mm512_mask_dbsad_epu8(__m512i s, __mmask32 m, __m512i a, __m512i b);</p>
<p>VDBPSADBW __m512i _mm512_maskz_dbsad_epu8(__mmask32 m, __m512i a, __m512i b);</p>
<p>VDBPSADBW __m256i _mm256_dbsad_epu8(__m256i a, __m256i b);</p>
<p>VDBPSADBW __m256i _mm256_mask_dbsad_epu8(__m256i s, __mmask16 m, __m256i a, __m256i b);</p>
<p>VDBPSADBW __m256i _mm256_maskz_dbsad_epu8(__mmask16 m, __m256i a, __m256i b);</p>
<p>VDBPSADBW __m128i _mm_dbsad_epu8(__m128i a, __m128i b);</p>
<p>VDBPSADBW __m128i _mm_mask_dbsad_epu8(__m128i s, __mmask8 m, __m128i a, __m128i b);</p>
<p>VDBPSADBW __m128i _mm_maskz_dbsad_epu8(__mmask8 m, __m128i a, __m128i b);</p>
<p><strong>SIMD Floating-Point Exceptions</strong></p>
<p>None</p>
<p><strong>Other Exceptions</strong></p>
<p>See Exceptions Type E4NF.nb.</p></body></html>